scycle
  • Home
  • Basic pipeline
    • Introduction
    • Pre-processing
    • Dimensionality reduction
    • Trajectory and cell division
    • Pseudotime
    • Cell cycle and RNA velocity
  • Integration of cell lines
scycle
  • Docs »
  • Basic pipeline

Basic pipeline¶

Introduction¶

We'll get started importing scycle itself, and from there using single-cell RNA-seq from an Ewing Sarcoma cell line named CHLA9 [ref for original work]. scycle accepts AnnData objects as imputs. In this case, we internally download and the data from a loom file, which will also allow us to showcase the integration of the pipeline with scvelo and RNA velocity.

In [1]:
import scycle as cc
In [2]:
chla9 = cc.data.get_data('CHLA9')
-- Loading data from cache...
In [3]:
chla9
Out[3]:
AnnData object with n_obs × n_vars = 5061 × 60662
    obs: 'TotalUMIs'
    var: 'Accession', 'AccessionVersion', 'Aliases', 'CcdsID', 'Chromosome', 'ChromosomeEnd', 'ChromosomeStart', 'CosmicID', 'DnaBindingDomain', 'FullName', 'GeneType', 'HgncID', 'IsTF', 'Location', 'LocationSortable', 'LocusGroup', 'LocusType', 'MgdID', 'MirBaseID', 'OmimID', 'PubmedID', 'RefseqID', 'RgdID', 'UcscID', 'UniprotID', 'VegaID'
    layers: 'matrix', 'spliced', 'unspliced'

Pre-processing¶

Though samples can be pre-processed by other pipelines before using scycle, we implemented our own pre-processing pipeline. In the filter_cells stage, only quality control attempting to deal with low-quality cells is done. We also have integration with DoubletDetection to perform doublet removal inside of this step.

In [4]:
cc.pp.filter_cells(chla9)
3851 cells pass the count filter
4599  cells pass the mt filter
Cells selected 3813

Then, the standard preprocessing recipe for scycle is called prep_pooling. This recipe includes an unconventional step that we find improves in the annotation of cell cycling: pooling (i.e. smoothing) cells using their nearest neighbors in a reference space. Otherwise, this step also performs library size normalization, log2-transformation and selects the most highly varying genes in the dataset.

In [5]:
cc.pp.prep_pooling(chla9)
Preparing embedding...
WARNING: genes are not in var_names and ignored: ['HIST1H4C', 'HIST1H1E', 'HIST2H2AC', 'HIST1H1C']
Embedding for pooling...
Pooling 3813 cells...
Scoring cell cycle...
WARNING: genes are not in var_names and ignored: ['HIST1H4C', 'HIST1H1E', 'HIST2H2AC', 'HIST1H1C']

An optional step in the pipeline that may improve downstream analysis, particularly finding the moment of cell division in the trajectory, is to normalize counts with reference to the mean total counts by partition. This is performed in a single-step in scycle in the pp module:

In [6]:
cc.pp.normalize_by_partition(chla9, n_ref_parts = 10)
Preprocessing reference with default values...
FastICA from sklearn did not converge due to numerical instabilities - Retrying...
-- Running `tl.self_consistent_trajectory` with n_ref_parts...
-- Trajectory not not computed: computing it with default parameters.
WARNING: The initial number of nodes must be at least 3. This will be fixed
WARNING: The initial number of nodes must be at least 3. This will be fixed
WARNING: The initial number of nodes must be at least 3. This will be fixed
WARNING: The initial number of nodes must be at least 3. This will be fixed
Re-normalizing counts by partition...

Dimensionality reduction¶

tl.dimensionality_reduction reduces the dimensionality of the data, by default, using Independent Component Analysis (ICA), as implemented in StabilizedICA. From the stable ICs, scycle then identifies the independent components related to cell cycle. Four possible dimensions are investigated: the two main known signatures of cell cycle, comprised of G1/S and G2/M, as well as 2 additional processes that are often but not always represented in the data. The Histones signature, which is comprised of histone genes, increases through S-phase and is stable through G2. The G2/M inhibitory (or G2-M-) signature related to a 'delayed' G2/M signal, which is still high at the start of G1 and whose transcripts start degrading later than transcripts forming the G2/M signature. This selection is performed by tl.dimensionality_reduction if method = 'ica' and find_cc_comps = True, which is default. It can be performed as a separate step invoking tl.find_cc_components.

In this example, we'll showcase an alternative dimensionality reduction methodology (method = 'self_consistent_CC), which starts from the approximation that uses the ICA followed by selection of cell-cycle components, and then uses the initial approximation of the cell cycle space to find genes related to cell cycle. These genes will be used as the reference for a principal component analysis, which will have a higher dimensionality space (n_comps-D) than the 3-4 D space found by the ICA. This has some benefits when calculating cell cycle trajectory later.

In [7]:
cc.tl.dimensionality_reduction(chla9, method = 'self_consistent_CC')
Dimensionality reduction using self consistent CC genes..
Initial estimation using ICA...
WARNING: The initial number of nodes must be at least 3. This will be fixed
Cell cycle genes initially found: 258
Top ones: ['H4C3', 'TOP2A', 'HMGB2', 'KPNA2', 'CCNB1', 'CDC20', 'UBE2C', 'H1-5', 'CKS2', 'H1-3', 'PLK1', 'DLGAP5', 'AURKA', 'KIF2C', 'NUF2', 'AURKB', 'TPX2', 'NUSAP1', 'CENPE', 'HMMR', 'TACC3', 'NCAPG', 'CENPA', 'AC027237.1', 'CDCA3', 'KIF20A']
WARNING: The initial number of nodes must be at least 3. This will be fixed


Iteration 1 ==================
Jaccard coeff: 0.9157509157509157 Old: 258 New: 265 
==================

WARNING: The initial number of nodes must be at least 3. This will be fixed


Iteration 2 ==================
Jaccard coeff: 0.9924812030075187 Old: 265 New: 265 
==================

Reducing dimensionality using self-consistent CC genes...

We can inspect our scycle results visually using the plotting module, but also by looking at the AnnData itself.

In [8]:
chla9
Out[8]:
AnnData object with n_obs × n_vars = 3813 × 10000
    obs: 'TotalUMIs', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'G1-S', 'G2-M', 'Histones', 'total_counts_raw'
    var: 'Accession', 'AccessionVersion', 'Aliases', 'CcdsID', 'Chromosome', 'ChromosomeEnd', 'ChromosomeStart', 'CosmicID', 'DnaBindingDomain', 'FullName', 'GeneType', 'HgncID', 'IsTF', 'Location', 'LocationSortable', 'LocusGroup', 'LocusType', 'MgdID', 'MirBaseID', 'OmimID', 'PubmedID', 'RefseqID', 'RgdID', 'UcscID', 'UniprotID', 'VegaID', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'scycle', 'log1p', 'dimRed', 'P_dimRed'
    obsm: 'X_dimRed', 'X_pca_scycle'
    layers: 'matrix', 'spliced', 'unspliced', 'unnorm'

The previous steps have added multiple layers and variables to the AnnData. Here, we're going to focus on the obsm, which are dimensionality reductions. Feel free to take a look at other additions (e.g. chla9.uns['scycle'] will have the parameters for functions that have run)

In [9]:
chla9.obsm['X_dimRed'].shape, chla9.obsm['X_pca_scycle'].shape
Out[9]:
((3813, 30), (3813, 3))

In this case, X_dimRed is a 30-D PCA in the self-consistent cell-cycle genes, while X_pca_scycle is also a PCA with 3 PCs, which is used by the plotting module. scycle provides 2-D and 3-D visualizations using cc.pl.cell_cycle_pca and cc.pl.cell_cycle_pca3D

Trajectory and cell division¶

In [10]:
cc.pl.cell_cycle_pca(chla9)
Out[10]:
<ggplot: (8747263022135)>

From this image, we can clearly see the trajectory, but also the effect of the partition counts normalization that segments the counts. We can get the trajectory from the principal elastic circle estimation using elpigraph, implemented in scycle as tl.trajectory:

In [11]:
cc.tl.trajectory(chla9)
WARNING: The initial number of nodes must be at least 3. This will be fixed

Now we can look at the trajectory. We'll also plot it using the unormalized total_counts (total_counts_raw). We can still see a clear gradual increase in total counts, and the sudden drop-off at the moment of cell division (around node 6-7). Of note, when performing the partition normalization and using the self-consistent CC genes for dimensionality reduction, we often can find a "dimple" at the moment of cell division.

In [12]:
cc.pl.cell_cycle_pca(chla9, trajectory = True, col_var = 'total_counts_raw')
Out[12]:
<ggplot: (8747263263470)>
In [13]:
cc.pl.cell_cycle_pca3d(chla9, col_var = 'total_counts_raw', alpha = 0.5, trajectory = True)

tl.cell_division will attempt to find the moment of cell division using total_counts (in this case, the normalized counts). It will set the direction of the trajectory as the direction of increasing counts, and the moment of cell division as the largest dip in total counts from the previous node. In this way, we can "break" the circle into a linear trajectory, and calculate pseudotime.

In [14]:
cc.tl.cell_division(chla9)
cc.pl.cell_division(chla9)
Suggested moment of cell division: [8 9]
Direction of cell cycle: 1
Remapping edges using [8 9] ...
Out[14]:
<ggplot: (8747263241548)>

Pseudotime¶

In [15]:
cc.tl.pseudotime(chla9)
cc.pl.pseudotime_histogram(chla9)
Calculating pseudotimes for each cell...
Out[15]:
<ggplot: (8747311433653)>

We can see the distribution of cell over pseudotime, but now we can in fact plot many variables vs pseudotime in our data. For example, we can look at un-normalized counts:

In [16]:
cc.pl.pseudotime_scatter(chla9, 'total_counts_raw')
Out[16]:
<ggplot: (8747311408312)>

Or the expression of some selected genes:

In [17]:
cc.pl.pseudotime_scatter(chla9, ['H4C3', 'TOP2A', 'HMGB2'])
Out[17]:
<ggplot: (8747311440863)>

And, obviously, at the dynamics of cell cycle signatures over pseudotime. We could use pl.pseudotime_scatter again, but this is implemented into another call to pl.cell_cycle_scores.

In [18]:
cc.pl.cell_cycle_scores(chla9)
Out[18]:
<ggplot: (8747311369753)>

tl.cell_cycle_phase uses the curvature of the trajectory to find inflection points in the expression of genes that define the cell cycle and set boundaries between cell cycle phases. It set the boundaries at the points in the trajectory where the curvature is the highest.

In [19]:
cc.tl.cell_cycle_phase(chla9)
cc.pl.cell_cycle_scores(chla9)
-- Suggested cell cycle division:
G1:  0   - 0.36666666666666664
S:  0.36666666666666664 - 0.7
G2: 0.7 - 0.95
M:  0.95 -   1
/home/clarice/.local/lib/python3.8/site-packages/plotnine/layer.py:467: PlotnineWarning:

geom_vline : Removed 1 rows containing missing values.

Out[19]:
<ggplot: (8747311239033)>

We can check how the cells were assigned into cell cycle phase 'categories' by looking at the pieplot:

In [20]:
cc.pl.cell_cycle_phase_pieplot(chla9)
Out[20]:
([<matplotlib.patches.Wedge at 0x7f4a4639f4f0>,
  <matplotlib.patches.Wedge at 0x7f4a4641a2b0>,
  <matplotlib.patches.Wedge at 0x7f4a4641a490>,
  <matplotlib.patches.Wedge at 0x7f4a4641c2b0>],
 [Text(1.1350375709490241, 0.18490460387491608, 'M'),
  Text(0.45045981163186866, 1.0581048899351053, 'G2'),
  Text(-1.148687177338094, 0.054934220837676906, 'S'),
  Text(0.5675992918693599, -1.0001655082382122, 'G1')],
 [Text(0.5428440556712724, 0.08843263663582941, '5.1%'),
  Text(0.21543730121524152, 0.5060501647515722, '26.9%'),
  Text(-0.5493721282921318, 0.02627288822671504, '34.4%'),
  Text(0.27146053089404165, -0.47834002567914496, '33.6%')])

or plotting using our friend pl.cell_cycle_pca setting col_var = 'cell_cycle_phase':

In [21]:
from plotnine import scale_color_manual
(cc.pl.cell_cycle_pca(chla9, 'cell_cycle_phase') 
 + scale_color_manual(values = ['#eb89c2', '#8ca0c9', '#ff8d68', '#5cc2a6']))
Out[21]:
<ggplot: (8747311436189)>

Cell cycle and RNA velocity¶

As scycle is built on top of AnnData, it is compatible with many other Python libraries that are built on the same platform, such as scanpy and scvelo. Since our example was loaded from a loom file, we have access to the unspliced and spliced matrices, and can look at RNA velocity in the cell cycle space

In [23]:
import scvelo as sv
sv.pp.moments(chla9)
sv.tl.velocity(chla9)
WARNING: Did not normalize X as it looks processed already. To enforce normalization, set `enforce=True`.
Normalized count data: spliced, unspliced.
WARNING: You seem to have 11 duplicate cells in your data. Consider removing these via pp.remove_duplicate_cells.
computing neighbors
    finished (0:00:03) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:06) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
computing velocities
    finished (0:00:05) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)
In [24]:
sv.tl.velocity_graph(chla9)
computing velocity graph
    finished (0:00:10) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
In [25]:
sv.pl.velocity_embedding_stream(chla9, basis = 'X_dimRed', color = 'pseudotime')
computing velocity embedding
    finished (0:00:01) --> added
    'velocity_dimRed', embedded velocity vectors (adata.obsm)

We can see in the stream plot a clear, orthogonal confirmation of inferred pseudotime directionality through RNA velocity.

Next Previous

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »